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Abstract 

Microsatellites or simple sequence repeats (SSRs) are repetitive stretches of nucleotides (A, T, G, C) that 
are distributed either as single base pair stretches or as a combination of two- to six-nucleotides units that 
are non-randomly distributed within coding and in non-coding regions of the genome. ChloroMitoSSRDB 
is a complete curated web-oriented relational database of perfect and imperfect repeats in organelle 
genomes. The present version of the database contains perfect and imperfect SSRs of 2161 organelle 
genomes (1982 mitochondrial and 179 chloroplast genomes). We detected a total of 5838 chloroplast 
perfect SSRs, 37 297 chloroplast imperfect SSRs, 5898 mitochondrial perfect SSRs and 50 355 mitochon- 
drial imperfect SSRs across these genomes. The repeats have been further hyperlinked to the annotated 
gene regions (coding or non-coding) and a link to the corresponding gene record in National Center for 
Biotechnology lnformation(www.ncbi. nlm.nih.gov/) to identify and understand the positional relation- 
ship of the repetitive tracts. ChloroMitoSSRDB is connected to a user-friendly web interface that provides 
useful information associated with the location of the repeats (coding and non-coding), size of repeat, 
motif and length polymorphism, etc. ChloroMitoSSRDB will serve as a repository for developing function- 
al markers for molecular phylogenetics, estimating molecular variation across species. Database URL: 
ChloroMitoSSRDB can be accessed as an open source repository at www.mcr.org.in/chIoromitossrdb. 
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1. Introduction 

Microsatellites, or simple sequence repeats (SSRs), 
are repetitive stretches of a tandemly repeated motif 
of one to six base pairs, which has evolved and 
expanded owing to the replication slippage mechan- 
ism that is supposed to be the cause of their high 
polymorphic rates. 1 Recently, using a genome-wide 
alignment of two Orzya species var. indica and 



japonica, it has been demonstrated that the distribu- 
tion of microsatellites is also influenced by the motif 
sequence and the sequence characteristics of the 
adjoining regions possessing the microsatellites, in 
addition to the replication slippage and point muta- 
tion model. 2 These repetitive stretches may occur in 
coding and in non-coding regions of the genome. 
SSRs have been potentially designated as a class of 
co-dominant markers for evaluating germplasm, 
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establishing phylogenetic and evolutionary relation- 
ships. It has been observed that clusters of microsatel- 
lite motifs with moderate GC are abundant on 
chromosome number 2 in the model plant 
Arabidopsis thaliana, which suggests that repetitive 
stretches may be biased towards the accumulation 
in a certain regions. 3 Microsatellites have been asso- 
ciated with various functional roles such as their pos- 
sible role in the regulation of promoters, transcription 
and translation, and these sequence repeats have 
been credited with evolutionary importance. 4-6 The 
positioning of microsatellites in the genome seems 
to play an important role in their regulatory activity; 
hence, studying the distribution and understanding 
the possible reasons of microsatellites expansions 
across genomes have currently been the focus of 
current intense research. 

Organelle genomes, plant chloroplast and animal 
mitochondrial genomes have been referred to as 
natural counterparts. 7,8 Features such as conserved 
gene order, lack of heteroplasmy (occurrence of 
more than one type of organelle genome), low re- 
combination rates and their relative small size are 
making these organelle genomes the widely used 
tools for phylogenetic studies. However, lack of het- 
eroplasmy has not been universally observed in all 
the mitochondrial genomes and has been earlier po- 
tentially reviewed with the occurrence and factors 
affecting the stoichiometry of heteroplasmy in mito- 
chondrial genomes of plants and animals. 9 The uni- 
parental inheritance of the organelle markers 
provides a means to elucidate the genetic flow and 
genetic structure of the population and the organelle 
markers have been widely used in population studies 
(for a review see Provan et al.). 8 in silico development 
of SSRs of organelle genomes has brought them up 
as potential markers for transferability among the 
species, ease of development and as key players in 
genome length variation. They have been widely 
demonstrated as potential markers for establishing 
molecular evolutionary histories, demographic diver- 
sity and resolving phylogeny in a wide variety of 
species from Pinus (forest species) to Oryza sativa 
(Monocots). 10-12 There have been recent reports on 
the identification of perfect repeats in organelle 
genomes of various organisms. 11-16 However, previ- 
ous studies have only been focused on a relatively 
small number of genomes and only perfect repeats 
have been identified. A proper characterization 
system that would allow researchers to search for 
the association of these repeats with the coding or 
non-coding regions has been lacking in these reports. 

In the past few years, systematic curated web reposi- 
tories have been developed for the organelle genomes, 
which includes FUGOIDthatdisplaysthecurated distri- 
bution of introns in organelle genomes with functional 



and structural data. 17 A database of universally pub- 
lished primer sequences of chloroplast genomes has 
been developed, providing a platform for studying 
molecular variations and evolution in chloroplasts. 
These organelle genomes have been exploited further 
for the mining of genes, exons, introns, gene products, 
taxonomy, RNA editing sites, SNPs and haplotype 
information, all of which are displayed as curated 
information in GOBASE. 1 9 A comprehensive repository 
of unique proteins expressed in chloroplast proteome 
using liquid chromatography-mass spectrometry/ 
mass spectrometry has been developed (AT_ 
CHLORO), serving as a knowledge base to explore the 
envelope proteins. 20 However, a complete curated 
web-oriented integrated repository of repeat pattern 
is still lacking. This has motivated us to undertake a 
genome-wide study and to develop a web-enabled 
interface to analyse the perfect and the imperfect 
repeats in organelle genomes. 

We propose ChloroMitoSSRDB that offers a wide 
visualization of perfect and imperfect repeats across 
the chloroplast and mitochondrial genomes with 
corresponding genomic coordinates. The aim of 
ChloroMitoSSRDB is to constitute a platform to access 
the utility of SSRs as markers for phylogenetic classifica- 
tion across species. To our knowledge, this is the first 
updated integrated repository of the genomic repeats 
in chloroplast and mitochondrial genomes accessible 
via web interface. 



2. Material and methods 

2. 7. Genome data retrieval and pattern search 

All the studied chloroplast (1 79) and mitochondrial 
(1982) genomes were retrieved from the National 
Center for Biotechnology Information (NCBI) RefSeq 
database (www.ncbi.nlm.nih.gov/). The required files 
such as gbk, fna, faa, gff and ptt were downloaded for 
the studied chloroplast and mitochondrial genomes 
and were stored as flat files sorted for each genome. 
For the identification of the perfect and imperfect 
repeats, the software tool Imperfect Microsatellite 
Extractor (IMEx) 21 has been used, which uses a sliding 
window algorithm to identify the regions with a repeti- 
tive stretch of a particular nucleotide motif either 
stretched perfectly or with levels of imperfection. 

The algorithm allows the user to specify the minimal 
length of the consecutive nucleotide stretch and 
reports the SSR motif, motif repeat counts, coordinates 
of the SSRs tract in the genome and its location relative 
to coding and non-coding regions. The association of 
the repeats in coding and intercoding regions was 
determined based on the sequence annotation infor- 
mation available in GenBank database (NCBI, www. 
ncbi.nlm.nih.gov). We applied the following length 



No. 2] 



G. Sablok et al. 



1 29 



NCBl Chloroplast and Mitochondrial Genomes (RefSeq) 





Genome level extraction of microsatellites (gSSRs) for microsatellites with length iterations 1-6 bp 














Perfect SSRs 




Imperfect SSRs ^^^^^ 



Coding Analysis 



MySQL relational database management 



Apache Webserver 



Figure 1. Schematic illustration showing the flow of the organization of the data in ChloroMitoSSRDB. 

criteria (Mono-, 1 2; Di-, 6;Tri-,4;and forTetra- to Hexa 
repeats, a minimum stretch of three minimum repeti- 
tions) to define each SSRs as a true repeat. In case of 
imperfect repeats, the parameter for imperfection 
percentage (p%) is set to 1 0% indicating the level of 
imperfection allowed in each repeat tract. 

3. Results and discussions 

3.1 . Structure of ChloroMitoSSRDB database 

ChloroMitoSSRDB is hosted on a 32-bit Linux server 
pre-installed with MySQL (http://www.mysql.com/), 
Apache (http://www.apache.org/) and PHP (http:// 
www.php.net/) commonly called as LAMP. A flow 
chart explaining the organization and the work flow 
of the ChloroMitoSSRDB has been presented (Fig. 1). 
ChloroMitoSSRDB is based on a simple comprehensive 
relational database management system, MySQL, that 
is sufficient for organizing, storing and retrieving the 
data with a single query. The details of the relational 
MySQL tables used in the construction of the 
ChloroMitoSSRDB database are explained in Tables 1 
and 2. Table 1 shows the metadata for each genome, 
whereas the structure of the MySQL relational tables 
depicting the repeat information stored for the 
coding and the non-coding regions is given in 
Table 2. Each query has been split into hierarchical 
levels of information that displays information on 
each Genome (e.g. accession, sequence length and 
nucleotide composition) (Table 1). 

The information for the genome composition (A-, 
T-, G- and C- counts, etc.) has been computed from 
the flat files obtained from the NCBl RefSeq database 
(Table 1). The complete repeat information of the 
database is stored in two different tables (refer 
Table 2), storing the perfect and imperfect repeats 
of all chloroplast and mitochondrial genomes. The 



Table 1. Structure of the table 'chloromitometa' that stores the 
meta-information of all the mitochondrial and chloroplast 
genomes 



Information 


Field 


Data type 


Key 


Example 


Accession 
number 


acc_no 


int(l 1) 




5881414, 
1 1 01 89662 


Sequence ID 


seqjd 


varchar(1 1 ) 


PRI 


NC_000834, 
AC_000022 


Sequence 
name 


seq_name 


varchar(500) 




Rattus norvegicus 

strain Wistar 

mitochondrion, 

Porphyra 

purpurea 

chloroplast 


Sequence 
type 


seq_type 


varchar(50) 




Complete 
genome, 
complete 
sequence 


Sequence 
length 


seqjength 


int(l 1) 




1 6 61 3 bp, 
7686 bp 


Nucleotide 
composition 
of A 


a_per 


Float 




33.06% 


Nucleotide 

composition 

ofT 


t_per 


Float 




41.87% 


Nucleotide 
composition 
of G 


g_per 


Float 




1 3.58% 


Nucleotide 
composition 
of C 


c_per 


Float 




1 1.49% 


Organelle 
type 


organelle 


CharO) 




M (for 

Mitochondrion), 
C (Chloroplast) 


Taxon ID 


taxon 


Int 




263 995 



repeat information includes the details of individual 
repeats such as the sequence ID, start and end coordi- 
nates of the repeat, the repeating motif, number of 
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Table 2. Structure of the tables 'chloromitoperfectmicrosatellite' and 'chloromitoimperfectmicrosatellite' that store the repeat 
information of all perfect and imperfect microsatellites of mitochondrial and chloroplast genomes 



Information 


Field 


Data type 


Key 


Example 


Sequence ID 


index_no 


varchar(1 1 ) 


PRI 


NC_000834, AC_000022 


Starting co-ordinate of SSR 


start 


int(1 1) 


PR! 


1 72, 1 2843 


Ending co-ordinate of SSR 


end 


int(1 1) 


PRI 


1 82, 1 2885 


motif (repeating unit) 


motif 


varchar(1 0) 




AT, G, CAAC 


Number of repetitions 


iterations 


int(5) 




3, 7 


Length of repeat tract 


tractjength 


int(1 1) 




1 2 bp, 1 8 bp 


Nucleotide composition of A 


a_per 


Float 




50.00% 


Nucleotide composition of T 


t_per 


Moat 




n n no/ 

U.UU 7o 


Nucleotide composition of G 


g_per 


Float 




33.33% 


Nucleotide composition of C 


c_per 


Float 




1 6.67% 


Repeat position information 


coding_info 


varchar(50) 




Coding (if repeat is in the coding region) or Null 
(if outside) 


Protein ID (if repeat in coding region) 


proteinjd 


int(1 1) 




1 1 01 89664 (if repeat is in the coding region) or 
0 (if non-coding) 


imperfection percentage of the tract 


imperfection 


Float 




9%, 0% 


a AIignment line 1 


alignmentjinel 


Text 




TTAA-TAATTAA 


a AIignment line 2 


alignment_line2 


Text 




**** ******* 


"Alignment line 3 


alignment_line3 


Text 




TTAATTAATTAA 



a The last fourcolumns (imperfection, alignmentjinel , alignmentjine2 and alignment_line3) are present only in the table 
that stores imperfect microsatellites (chloromitoimperfectmicrosatellite). 



iterations, total tract length, nucleotide composition 
of the repeat, protein information of coding repeats. 
In addition, the table displaying the imperfect 
repeats also stores the imperfection percentage and 
alignment information that can be used to study 
the evolution of these repeats. 



3.2. Web visualization of ChloroMitoSSRDB 

The front end of the database is integrated via web 
accessible PHP scripts. The web interface allows 
various patterns of search for the repeats in organelle 
genomes. The complete browsing outlay of the 
ChloroMitoSSRDB is displayed (Fig. 2). The curated in- 
formation is organized into several search patterns, 
and proper navigation pages have been provided. The 
curated information from the IMEx has been processed 
further according to gene IDs, organism name, and the 
SS Rs we re so rted acco rdingtothe cod ingornon-coding 
regions. The position of the coding regions has been 
determined using the annotated ptt files of each 
chloroplast and mitochondrial genome as downloaded 
from the NCBI Refseq database. 

ChloroMitoSSRDB interface provides information on 
several repeat statistics, including the distribution of 
the repeat types, length of the motifs and their posi- 
tions (coding or non-coding repeats). The querying 
of ChloroMitoSSRDB through the web interface is 
organized into three search patterns that accomplish 
all interface functionalities: query page, result page 



and report page: (i) the first search pattern is accord- 
ing to the organelle classification and it has been clas- 
sified into chloroplast and mitochondrial genomes, 
(ii) the second search pattern has been classified 
according to the type of repeat pattern (perfect or im- 
perfect) and (iii). the last search pattern allows the 
user to select the repeat size. With the appropriate se- 
lection pattern, the user will be directed to the organ- 
elle-specific page (chloroplast and mitochondrial) 
containing the list of the organism for which the 
SSRs have been identified, which are further linked 
to the organism-specific repeat pages for further in- 
formation on the distribution of the repetitive tracts. 

To ease the access of the database and to enhance 
the user functionality, we also provide chloroplast 
and mitochondrial repeat-specific pages alphabetically 
ordered according to the organism name. An advanced 
search option has been provided to filter the repeats 
based on the user-specific criteria allowing the user 
to search for a repeat region of a specific length. An 
option to export the search results and the repeat infor- 
mation in excel format has been provided, so that the 
user can save and analyse the repeats, design primers 
and can utilize the information for further downstream 
processing of the observed repeats. 

A query page for every organism is directed to a 
ChloroMitoSSRDB repeat summary page for organ- 
ism-specific summary page that gives a detailed illus- 
tration of the distribution of the perfect and the 
imperfect repeats distribution and the genome 
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Figure 2. How to browse: schematic browsing of ChloroMitoSSRDB. 
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composition using bar and pie charts. The genome 
composition and the repeat occurrence graphs were 
generated dynamically based on the repeat informa- 
tion using Libchart, a PHP chart drawing library (http 
://naku.dohcrew.com/libchart/). The repeat pattern 
summary displayed on the organism specific page are 
clickable links, which redirects and give further 
information on the start and end of the SSR repeat con- 
taining tract, Motif and the occurrence of the respective 
repeat pattern across the genomes. 

Mutations in the SSR stretches prevailing in the 
coding region may affect the subsequent transcription 
and translation of the gene harbouring the repetitive 
stretches of SSRs. 22 Mutations in chloroplast SSRs 
(mutation rates at cpSSR loci as between 3.2 x 1 0~ 5 
and 7.9 x 10~ 5 ) have been described as low when 
compared with substitution rates. 23 Recently, it was 
observed that the plant mitochondrial substitution 
rates are relatively lower when compared with 
the invertebrates and mammalian mitochondrial 
genomes. 24,25 To evaluate the distribution of the SSRs 
in the coding regions, the repeat-rich regions on the or- 
ganism page have been linked to the corresponding 
protein IDs (NCBI, www.ncbi.nlm.nih.gov/), in case of 
coding repeats, which can shed light on the evolution 
of these repeated regions either through mutational 
bias or through selective forces in further ongoing work. 



4. Conclusion 

We have consecutively constructed a database 
ChloroMitoSSRDB that displays curated information of 
wide spread occurrences of genomic repeats in chloro- 
plast and mitochondrial genomes available so far, 
and we will be constantly updating ChloroMitoSSRDB 
with the new chloroplast and mitochondrial genomes 
as and when they are released. The repeats in the 
coding regions of the genes may prove to be candidate 
markers to study the functional role of repeats asso- 
ciated with the genes, as possible markers for species de- 
limitation, evolutionary analyses and also for evaluating 
the germplasm and to hypothesize conservation strat- 
egies for endangered species. In future release, we will 
make efforts to upgrade the primer pair information 
for the repeat-rich regions and will also upgrade the 
database with the systematic visualization of imperfect 
alignments through the availability of hyperlinked 
pages in case of imperfect repeats. We believe that 
ChloroMitoSSRDB will serve as a standard database 
for exploring and understanding genomic repeats in 
organelle genomes, and the data represented in 
ChloroMitoSSRDB make a good starting point for 
furtherexploratory investigations on SSR polymorphism, 
large comparative genome comparison and provide a 



platform to understand the repetitive nature of organ- 
elle genomes. 
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